Downloading Hidden Web Content
نویسندگان
چکیده
An ever-increasing amount of information on the Web today is available only through search interfaces: the users have to type in a set of keywords in a search form in order to access the pages from certain Web sites. These pages are often referred to as the Hidden Web or the Deep Web. Since there are no static links to the Hidden Web pages, search engines cannot discover and index such pages and thus do not return them in the results. However, according to recent studies, the content provided by many Hidden Web sites is often of very high quality and can be extremely valuable to many users. In this paper, we study how we can build an effective Hidden Web crawler that can autonomously discover and download pages from the Hidden Web. Since the only “entry point” to a Hidden Web site is a query interface, the main challenge that a Hidden Web crawler has to face is how to automatically generate meaningful queries to issue to the site. Here, we provide a theoretical framework to investigate the query generation problem for the Hidden Web and we propose effective policies for generating queries automatically. Our policies proceed iteratively, issuing a different query in every iteration. We experimentally evaluate the effectiveness of these policies on 4 real Hidden Web sites and our results are very promising. For instance, in one experiment, one of our policies downloaded more than 90% of a Hidden Web site (that contains 14 million documents) after issuing fewer than 100 queries.
منابع مشابه
Topic-Sensitive Hidden-Web Crawling
A constantly growing amount of high-quality information is stored in pages coming from the Hidden Web. Such pages are accessible only through a query interface that a Hidden-Web site provides and may span a variety of topics. In order to provide centralized access to the Hidden Web, previous works have focused on query generation techniques that aim at downloading all content of a given Hidden ...
متن کاملHidden-Web Induced by Client-Side Scripting: An Empirical Study
Client-side JavaScript is increasingly used for enhancing web application functionality, interactivity, and responsiveness. Through the execution of JavaScript code in browsers, the DOM tree representing a webpage at runtime, can be incrementally updated without requiring a URL change. This dynamically updated content is hidden from general search engines. In this paper, we present the first em...
متن کاملOn the Image Content of a Web Segment: Chile as a Case Study
We propose a methodology to characterize the image contents of a web segment, and we present an analysis of the contents of a segment of the Chilean web (.CL domain). Our framework uses an efficient web-crawling architecture, standard content-based analysis tools (to extract low-level features such as color, shape and texture), and novel skin and face detection algorithms. In an automated proce...
متن کاملUser's Interests Navigation Model Based on Hidden Markov Model
To Find user’s frequent interest navigation patterns, we combine the information of web content and web server log, mine the web data to build a hidden markov model. In our approach, we build a hidden markov model according to page’s content and web service log firstly, and then we present a new incremental discovery algorithm Hmm_R to discover the interest navigation patterns. ...
متن کاملCrawling the Hidden Web ( Extended
Current-day crawlers retrieve content from the publicly indexable Web, i.e., the set of web pages reachable purely by following hypertext links, ignoring search forms and pages that require authorization or prior registration. In particular, they ignore the tremendous amount of high quality content “hidden” behind search forms, in large searchable electronic databases. Our work provides a frame...
متن کامل